44 research outputs found

    Incorporating unobserved heterogeneity and multiple event types in survival models : a Bayesian approach

    Get PDF
    This thesis covers theoretical and practical aspects of Bayesian inference and survival analysis, which is a powerful tool for the analysis of the time until a certain event of interest occurs. This dissertation focuses on non-standard models inspired by features of real datasets that are not accommodated by conventional models. Materials are divided in two parts. The first and more extended part relates to the development of flexible parametric lifetime distributions motivated by the presence of anomalous observations and other forms of unobserved heterogeneity. Chapter 2 presents the use of mixture families of lifetime distributions for this purpose. This idea can be interpreted as the introduction of an observation-specific random effect on the survival distribution. Two families generated via this mechanism are studied in Chapter 3. Covariates are introduced through an accelerated failure times representation, for which the interpretation of the regression coefficients is invariant to the distribution of the random effect. The Bayesian model is completed using reasonable (improper) priors that require a minimum input from practitioners. Under mild conditions, these priors induce a well-defined posterior distribution. In addition, the mixture structure is exploited in order to propose a novel method for outlier detection where anomalous observations are identified via the posterior distribution of the individual-specific random effects. The analysis is illustrated in Chapter 4 using three real medical applications. Chapter 5 comprises the second part of this thesis, which is motivated in the context of university outcomes. The aim of the study is to identify determinants of the length of stay at university and its associated academic outcome for undergraduate students of the Pontificia Universidad Católica de Chile. In this setting, survival times are defined as the time until the end of the enrollment period, which can relate to different reasons - graduation or dropout - that are driven by different processes. Hence, a competing risks model is employed for the analysis. Model uncertainty is handled through Bayesian model averaging, which leads to a better predictive performance than choosing a unique model. The output of this analysis does not account for all features of this complex dataset yet it provides a better understanding of the problem and a starting point for future research. Finally, Chapter 6 summarizes the main findings of this work and suggests future extensions

    Bayesian survival modelling of university outcomes

    Get PDF
    Dropouts and delayed graduations are critical issues in higher education systems world wide. A key task in this context is to identify risk factors associated with these events, providing potential targets for mitigating policies. For this, we employ a discrete time competing risks survival model, dealing simultaneously with university outcomes and its associated temporal component. We define survival times as the duration of the student's enrolment at university and possible outcomes as graduation or two types of dropout (voluntary and involuntary), exploring the information recorded at admission time (e.g. educational level of the parents) as potential predictors. Although similar strategies have been previously implemented, we extend the previous methods by handling covariate selection within a Bayesian variable selection framework, where model uncertainty is formally addressed through Bayesian model averaging. Our methodology is general; however, here we focus on undergraduate students enrolled in three selected degree programmes of the Pontificia Universidad Católica de Chile during the period 2000–2011. Our analysis reveals interesting insights, highlighting the main covariates that influence students’ risk of dropout and delayed graduation

    Incorporating unobserved heterogeneity in Weibull survival models : a Bayesian approach

    Get PDF
    Outlying observations and other forms of unobserved heterogeneity can distort inference for survival datasets. The family of Rate Mixtures of Weibull distributions includes subject-level frailty terms as a solution to this issue. With a parametric mixing distribution assigned to the frailties, this family generates flexible hazard functions. Covariates are introduced via an Accelerated Failure Time specification for which the interpretation of the regression coefficients does not depend on the choice of mixing distribution. A weakly informative prior is proposed by combining the structure of the Jeffreys prior with a proper prior on some model parameters. This improper prior is shown to lead to a proper posterior distribution under easily satisfied conditions. By eliciting the proper component of the prior through the coefficient of variation of the survival times, prior information is matched for different mixing distributions. Posterior inference on subject-level frailty terms is exploited as a tool for outlier detection. Finally, the proposed methodology is illustrated using two real datasets, one concerning bone marrow transplants and another on cerebral palsy

    A review on competing risks methods for survival analysis

    Full text link
    When modelling competing risks survival data, several techniques have been proposed in both the statistical and machine learning literature. State-of-the-art methods have extended classical approaches with more flexible assumptions that can improve predictive performance, allow high dimensional data and missing values, among others. Despite this, modern approaches have not been widely employed in applied settings. This article aims to aid the uptake of such methods by providing a condensed compendium of competing risks survival methods with a unified notation and interpretation across approaches. We highlight available software and, when possible, demonstrate their usage via reproducible R vignettes. Moreover, we discuss two major concerns that can affect benchmark studies in this context: the choice of performance metrics and reproducibility.Comment: 22 pages, 2 table

    Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell RNA Sequencing Data.

    Get PDF
    Cell-to-cell transcriptional variability in otherwise homogeneous cell populations plays an important role in tissue function and development. Single-cell RNA sequencing can characterize this variability in a transcriptome-wide manner. However, technical variation and the confounding between variability and mean expression estimates hinder meaningful comparison of expression variability between cell populations. To address this problem, we introduce an analysis approach that extends the BASiCS statistical framework to derive a residual measure of variability that is not confounded by mean expression. This includes a robust procedure for quantifying technical noise in experiments where technical spike-in molecules are not available. We illustrate how our method provides biological insight into the dynamics of cell-to-cell expression variability, highlighting a synchronization of biosynthetic machinery components in immune cells upon activation. In contrast to the uniform up-regulation of the biosynthetic machinery, CD4+ T cells show heterogeneous up-regulation of immune-related and lineage-defining genes during activation and differentiation.NE was funded by the European Molecular Biology Laboratory (EMBL) international PhD programme. ACR was funded by the MRC Skills Development Fellowship (MR/P014178/1). SR was funded by MRC grant MC_UP_0801/1. JCM was funded by core support of Cancer Research UK and EMBL. CAV was funded by The Alan Turing Institute, EPSRC grant EP/N510129/1

    Model updating after interventions paradoxically introduces bias

    Get PDF
    Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such `naive updating' when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.Comment: Sections of this preprint on 'Successive adjuvancy' (section 4, theorem 2, figures 4,5, and associated discussions) were not included in the originally submitted version of this paper due to length. This material does not appear in the published version of this manuscript, and the reader should be aware that these sections did not undergo peer revie

    Normalizing single-cell RNA sequencing data: challenges and opportunities

    Get PDF
    Single-cell transcriptomics is becoming an important component of the molecular biologist's toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users

    Integration of datasets for individual prediction of DNA methylation-based biomarkers

    Get PDF
    BACKGROUND: Epigenetic scores (EpiScores) can provide biomarkers of lifestyle and disease risk. Projecting new datasets onto a reference panel is challenging due to separation of technical and biological variation with array data. Normalisation can standardise data distributions but may also remove population-level biological variation.RESULTS: We compare two birth cohorts (Lothian Birth Cohorts of 1921 and 1936 - nLBC1921 = 387 and nLBC1936 = 498) with blood-based DNA methylation assessed at the same chronological age (79 years) and processed in the same lab but in different years and experimental batches. We examine the effect of 16 normalisation methods on a novel BMI EpiScore (trained in an external cohort, n = 18,413), and Horvath's pan-tissue DNA methylation age, when the cohorts are normalised separately and together. The BMI EpiScore explains a maximum variance of R2=24.5% in BMI in LBC1936 (SWAN normalisation). Although there are cross-cohort R2 differences, the normalisation method makes a minimal difference to within-cohort estimates. Conversely, a range of absolute differences are seen for individual-level EpiScore estimates for BMI and age when cohorts are normalised separately versus together. While within-array methods result in identical EpiScores whether a cohort is normalised on its own or together with the second dataset, a range of differences is observed for between-array methods.CONCLUSIONS: Normalisation methods returning similar EpiScores, whether cohorts are analysed separately or together, will minimise technical variation when projecting new data onto a reference panel. These methods are important for cases where raw data is unavailable and joint normalisation of cohorts is computationally expensive.</p
    corecore